E03 - qPCR (Analysis of Gene Expression @ UCT Prague)
Copy, please, these files and directories to your personal directory:
cp -r ~/shared/AGE2021/Exercises/{E03-qPCR,age_library.R} ~/AGE2021/ExercisesAssignment
The assignment for this exercise is to implement (mainly visualization) functions, which you will use later on different types of gene expression data. You will implement these functions in age_library.R where are empty function definitions. You donβt have to stick to all predefined parameters and returned variables, just keep in mind that your implementations should make outputs similar to outputs in this document. Also, in several cases there are alternative functions (such as plot_pca() and plot_pca_ggpairs()), but you donβt have to implement them all, if not said.
Sections containing functions to be implemented are marked with π Anyway, here is the minimum set of functions you need to implement (arguments are up to you, itβs about the output):
plot_hc(m, color_by): Calculate hiearchical clustering of matrixmcolumns and plot a dendrogram, whose labels are colored according tocolor_byvector.plot_pca(m, sample_data, color_by): Calculate PCA of matrixmand make a scatterplot of first two principal compoments with points colored bycolor_bycolumn insample_data.plot_heatmap(m, column_annotation, z_score = FALSE): Make a heatmap of matrixmwith column annotation specified incolumn_annotationdataframe. Optionally, convert values inmto z-score (row-wise).plot_boxplots(plot_data, x, y): Using longplot_data, plot boxplots of values inycolumn, splitted to groups defined inxcolumn.compute_m(gene, cp): Compute the M value of \(C_P\) values (geNorm algorithm).test_gene(gene, gene_data, gene_col, value_col, group_col): Using longgene_data, calculate between group statistical test (e.g.Β t-test) ofgenedefined ingene_colcolumn with values invalue_colcolumn, and groups ingroup_colcolumn.test_gene_table(gene_data, gene_col, value_col, group_col): The same astest_gene(), but calculate tests for all genes ingene_colcolumn and return a dataframe of results.
After you implement these functions, modify their calling in this document and render it again. Obviously, you need to remove sections with functions you havenβt implemented.
Please, return the assignment in this form:
- In the header, put your name and e-mail to
authorparameter (underparams). - Copy all finished files to directory named
E03-qPCR_assignment_<your_name>: modified Rmd, rendered HTML,age_library.Rwith your implementations. Put this directory to ZIP archive namedE03-qPCR_assignment_<your_name>.zip.- In rendered HTML there should be included all code chunks with full output (warnings, messages, etc.).
- Upload the ZIP file to the appropriate assignment in MS Teams.
Note that your functions will be universal, serving not only for qPCR data. This is mainly true for exploratory analysis functions (hiearchical clustering, PCA, etc.), where input (expression matrix) is same also for microarrays and RNA-seq data.
Introduction
Quantitative real-time PCR monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real time), not at its end, as in conventional PCR.
qPCR measurement
In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules. This amplification depends on initial RNA concentration and approximates the gene expression. We select a fluorescence signal threshold \(S_T\) and measure crossing points \(C_P\) (also called threshold cycle \(C_T\)) for samples. To correct for initial RNA concentration and quality, reference genes are used (dashed curves). These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline. First, we normalize target genes in each sample to reference gene (or they geometric mean) by calculating \(\Delta C_P = C_P^t - C_P^r\). Obtained \(\Delta C_P\) values are then used to calculate the relative change of gene expression between samples. In this case, we are interested how much more or less is gene expressed in \(\text{treatment}\) sample (\(\text{trt}\)), relative to \(\text{control}\) (\(\text{cnt}\)) sample. This relative change is called \(\text{fold-change}\): \(FC = 2^{-\Delta\Delta C_P} = 2^{-(\Delta C_{P, trt} - \Delta C_{P, cnt})}\). We usually report \(FC\) in \(log_2\) scale: \(LFC = \Delta C_{P, cnt} - \Delta C_{P, trt} = log_2(FC)\) Important is to realize that the higher \(C_P\) is, the lower initial mRNA concentration was. This is why we put minus /\(-\)/ in front of \(\Delta\Delta C_P\). We usually donβt consider the efficiency of PCR and expect the mRNA concentration to double in each cycle. Otherwise \(FC = (1+E)^{-\Delta\Delta C_P}\), where \(E\) is the PCR efficiency.
Our experiment
We will work with data from pancreatic tumor experiment where an influence of spirulina algae was investigated. In this experiment there are two sample groups:
- control: samples from tumor.
- spirulina: samples from tumor treated by extract from spirulina.
Each sample group has three biological replicates (grown on different Petri dish). In addition, each biological replicate is cultivated for 50% and 90% confluence, and for each confluence there are two technical replicates. Uhhh, a nested design, rightβ¦
But we can expand it to:
- control group
|
|-- biological replicate #1 (Petri dish #1)
| |-- confluence 50
| | |-- technical replicate #1 -> sample C1_C50_R1
| | |-- technical replicate #2 -> sample C2_C50_R2
| |-- confluence 90
| |-- technical replicate #1 -> sample C1_C90_R1
| |-- technical replicate #2 -> sample C1_C90_R2
|-- biological replicate #2 (Petri dish #2)
|-- confluence 50
| |-- technical replicate #1 -> sample C2_C50_R1
| |-- technical replicate #2 -> sample C2_C50_R2
|-- confluence 90
|-- ...
- spirulina group
|
|-- biological replicate #1 (Petri dish #1)
| |-- confluence 50
| | |-- technical replicate #1 -> sample S1_C50_R1
| | |-- technical replicate #2 -> sample S1_C50_R2
| |-- confluence 90
| |-- technical replicate #1 -> sample S1_C90_R1
| |-- technical replicate #2 -> sample S1_C90_R2
|-- biological replicate #2 (Petri dish #2)
|-- confluence 50
| |-- technical replicate #1 -> sample S2_C50_R1
| |-- technical replicate #2 -> sample S2_C50_R2
|-- confluence 90
|-- ...
Also, there are two groups of measured genes:
- Target genes - interesting genes found prior by microarray exploratory analysis.
- Housekeeping genes - used as reference genes.
Libraries
library(here)
library(tidyverse)
library(dplyr)
library(glue)
library(magrittr)Now we source your personal library:
source(here("age_library.R"))Config
I would recommend you to always define constants at the beginning of a script. It is more transparent, and if you, for example, change some path, you havenβt to replace it in whole source code. Constants are, by convention, named in UPPERCASE.
BASE_DIR <- here("E03-qPCR")
CP_DATA <- here(BASE_DIR, "data/qPCR.Rds")Data preprocessing
\(C_P\) matrix
We have already preprocessed a \(C_P\) matrix for you. But if you really want to start from the floor, you can create the \(C_P\) matrix from original files (here and here). Not very pretty, right? π€
In \(C_P\) matrix, columns are samples, rows are genes, and values are \(C_P\) values. This format is generally used for gene expression data and you will see it also in case of microarrays and RNA-seq.
cp <- readRDS(CP_DATA) %>%
as.data.frame()
genes <- rownames(cp)
samples <- colnames(cp)
cp[1:5, 1:5]CP values >= 40 are considered out of qPCR limit and so we set them to NA:
cp[cp > 39.9] <- NASome genes have so many NA values (respectively, they donβt have many values other than NA):
lq_genes <- rowSums(!is.na(cp))
lq_genes## CD24 CD31 CXCL12 CXCR4 CXCR7 FLT1 VEGFA VEGRF2 ACTB H4 GAPDH H5 RPS9 H11 TBP H26
## 24 19 2 10 0 0 24 2 24 24 24 24
Letβs remove genes with number of non-NA values less or equal to 2:
lq_genes <- names(lq_genes[lq_genes <= 2])
genes <- setdiff(genes, lq_genes)
cp <- cp[genes, ]Now we split genes to target and reference ones. The latter have H<number> on the end of their names, which we can match with regular expression (regex) H\d+$.
ref_genes <- str_subset(genes, "H\\d+$")
target_genes <- setdiff(genes, ref_genes)
gene_groups <- if_else(genes %in% ref_genes, "reference", "target")
(genes_df <- data.frame(
gene = as.character(genes),
gene_group = factor(gene_groups, levels = c("reference", "target")),
row.names = genes,
stringsAsFactors = FALSE)
)For plotting purposes, we create a new dataframe with NAs replaced by 40:
cp_plot <- replace(cp, is.na(cp), 40)Phenotypical data (sample sheet)
Along with measured data you usually obtain a sample sheet, but in this case we are going to parse the column names of \(C_P\) matrix. Everything needed is contained there:
- K and SP: sample groups
- Number behind sample group: Petri dish (= biological replicate)
- Number behind slash (β/β): confluence
- (2RT): second technical replicate
str_match() is returning a character matrix of capturing groups (columns) for each string (rows). The first column is always the full match.
(samples <- colnames(cp))## [1] "K1/50" "K1/50(2RT)" "K1/90" "K1/90 (2RT)" "K2/50" "K2/50(2RT)" "K2/90" "K2/90 (2RT)" "K3/50" "K3/50(2RT)" "K3/90" "K3/90 (2RT)"
## [13] "SP1/50" "SP1/50 (2RT)" "SP1/90" "SP1/90 (2RT)" "SP2/50" "SP2/50 (2RT)" "SP2/90" "SP2/90 (2RT)" "SP3/50" "SP3/50(2RT)" "SP3/90" "SP3/90 (2RT)"
pheno_data <- data.frame(
sample_id = samples,
# Match with a single capturing group (full match): "SP" or "K" on the beginning.
sample_group_code = str_match(samples, "^SP|K")[, 1] %>% recode_factor("K" = "c", "SP" = "sp"),
# Match a sample group on the beginning and then the number behind (second group).
petri_number = str_match(samples, "^[A-Z]{1,2}(\\d)")[, 2] %>% as.factor(),
# Match a two-digit number behind "/".
confluence = str_match(samples, "\\/(\\d{2})")[, 2] %>% as.factor(),
# Match a technical replicate number before "RT". See how literal match of parentheses ("(" and ")") must be written:
# first you have to escape backslash in R as "\\" to escape parenthese for regex as "\\(", which will match the character "(" literally,
# e.g. not specifying the capturing group.
# Note that first technical replicates doesn't have this part, so their match will be NA, which we replace by "1".
replicate = str_match(samples, "\\((\\d)RT\\)")[, 2] %>% replace_na("1") %>% as.factor()
) %>%
dplyr::mutate(
sample_name_rep = glue("{sample_group_code}{petri_number}_{confluence}_r{replicate}") %>% as.character(),
sample_name = glue("{sample_group_code}{petri_number}_{confluence}") %>% as.character(),
sample_group = recode_factor(sample_group_code, "c" = "control", "sp" = "spirulina")
) %>%
dplyr::select(sample_name_rep, sample_name, everything()) %>%
set_rownames(.$sample_name_rep)
colnames(cp) <- rownames(pheno_data)
colnames(cp_plot) <- rownames(pheno_data)
samples <- colnames(cp)
head(pheno_data)If you are not sure what regexes above mean, go to regex101.com and try them. You can place multiple sample names on separate lines. In regexes you just have to replace double backslashes (\\) by a single one. Thatβs because backslash is used in R (and other programming languages) for escaping. Shortly, character after backslash has a special meaning (e.g.Β \n is used for newline) and because regexes themselves use backslashes (e.g.Β \d means βany digitβ), you have to tell R to treat backslash as normal character, and not as the start of escaping.
Example from regex101.com
Long data
We also create long data from cp, pheno_data and genes_df:
cp_long <- as.data.frame(cp) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name_rep", values_to = "cp") %>%
dplyr::left_join(pheno_data, by = "sample_name_rep") %>%
dplyr::left_join(genes_df, by = "gene") %>%
dplyr::mutate(
cp_plot = if_else(is.na(cp), 40, cp)
) %>%
dplyr::select(sample_name_rep, sample_name, gene, cp, cp_plot, gene_group, everything())
head(cp_long)Exploratory analysis on raw data
We are using exploratory analysis to have an unbiased first look at data. It is also a crucial step to assess the biological quality control, e.g.Β whether samples from distinct groups create separate clusters - if not, samples might be swapped etc.
Hiearchical clustering π
Function to be implemented within the assignment. It takes an expression matrix and returns a dendrogram with coloring by a chosen variable. I recommend you to use the dendextend package to modify a dendrogram, it is much easier than doing so in base R.
coloring_variables <- c("sample_group", "confluence", "petri_number")
for (variable in coloring_variables) {
plot_hc(
cp_plot, color_by = pheno_data[, variable],
method_distance = "euclidean", method_clustering = "complete",
color_by_lab = variable
)
}We can see the main effect - with several exceptions, samples are clustered by sample groups, which is good. There isnβt any effect of other variables, which would be confounding.
PCA π
Function to be implemented within the assignment. It will perform PCA visualization with point coloring by a chosen variable. You can also output plots of first N principal components and variance explained bar plot.
We can see that the first principal component (PC1) is nicely separating our samples by sample group:
plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence")$plotplot_pca(cp_plot, pheno_data, plot_type = "multi", color_by = "sample_group", shape_by = "confluence")$plotplot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name_rep")$plotplot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "petri_number")$plot- You can also use
GGally::ggpairs()to plot a grid of PCA plots (PC1 vs.Β PC2, PC1 vs.Β PC3, PC2 vs.Β PC3, etc.):
plot_pca_ggpairs(cp_plot, pheno_data, n_components = 5, color_by = "sample_group", shape_by = "confluence")$plotHeatmap π
Function to be implemented within the assignment.
ComplexHeatmappackage is preferred. For interactive heatmaps you can use heatmaply (introduced in E02-intro to R).
p_heatmap <- plot_heatmap(
cp,
column_annotation = dplyr::select(pheno_data, `Sample Group` = sample_group, `Petri Dish` = petri_number, Confluence = confluence, Replicate = replicate),
row_annotation = dplyr::select(genes_df, `Gene Group` = gene_group),
legend_title = "Cp",
show_column_names = FALSE
)
draw(p_heatmap, merge_legend = TRUE)Can you see any outlying samples (replicates)? If yes, should they be removed?
Selecting the reference genes
We will use housekeeping genes as reference genes. Housekeeping genes are genes which have similar expression profile regardless the cell type or state. That is why they can be used for sample normalization (as an internal control).
Clustering
Housekeeping genes should cluster together. Letβs try it using hierachical clustering:
plot_hc(t(cp), color_by = gene_groups, color_by_lab = "Gene groups")Also in PCA, we should see them separated from target genes:
plot_pca(t(cp_plot), sample_data = genes_df, color_by = "gene_group")$plotNot very sharp clustering, but PC1 separates the gene groups well.
Correlation
Expression of housekeeping genes is/should be correlated.
main_title <- "Raw data correlation"
t(cp) %>%
as.data.frame() %>%
GGally::ggpairs(progress = FALSE) +
theme_bw()Letβs look more closely at the correlation coefficients:
(gene_cor <- cor(t(cp), use = "pairwise.complete.obs"))## CD24 CD31 CXCR4 VEGFA ACTB H4 GAPDH H5 RPS9 H11 TBP H26
## CD24 1.00000000 0.03970592 -0.28809768 0.73649418 0.73547823 0.73141005 0.76319246 0.72419041
## CD31 0.03970592 1.00000000 -0.73606175 0.03747391 0.03172092 0.17532289 0.10746231 0.08438151
## CXCR4 -0.28809768 -0.73606175 1.00000000 -0.20149851 -0.18053811 -0.02185175 0.08141219 -0.25461239
## VEGFA 0.73649418 0.03747391 -0.20149851 1.00000000 0.77468336 0.82725327 0.83395485 0.77534719
## ACTB H4 0.73547823 0.03172092 -0.18053811 0.77468336 1.00000000 0.92233124 0.89743784 0.93865311
## GAPDH H5 0.73141005 0.17532289 -0.02185175 0.82725327 0.92233124 1.00000000 0.96305770 0.93496580
## RPS9 H11 0.76319246 0.10746231 0.08141219 0.83395485 0.89743784 0.96305770 1.00000000 0.90135469
## TBP H26 0.72419041 0.08438151 -0.25461239 0.77534719 0.93865311 0.93496580 0.90135469 1.00000000
ggcorrplot::ggcorrplot(gene_cor, method = "circle") +
ggtitle(main_title)You can also use the heatmaply package:
plot_heatmaply(gene_cor, row_annotation = dplyr::select(genes_df, `Gene Group` = gene_group), main = main_title, legend_title = "Correlation")At the first sight, housekeeping (reference) genes are well correlated, with correlation coefficients over 0.9
Housekeeping genes should cluster together also by their correlation coefficients. We can try both Euclidean and Pearson distance (1 - correlation coefficient) for hiearchical clustering.
plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "euclidean", title = "Hierarchical clustering (Euclidean distance)")plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "pearson", title = "Hierarchical clustering (Pearson distance)")PCA of correlation coefficients is also helpful:
plot_pca(gene_cor, sample_data = genes_df, color_by = "gene_group")$plotFrom both hiearchical clustering and PCA you can see that reference genes make a compact cluster, which is good.
geNorm M-values π
Function (
compute_m()) to be implemented within the assignment. Your functions should output values similar to the below ones (or at least numerically very close to them).
Until now we were doing rather exploratory analysis of reference genes, but itβs time to make a more rigorous decision. M values are telling you how stable is a gene expressed across the samples. From the original paper:
βThis measure relies on the principle that the expression ratio of two ideal internal control genes is identical in all samples, regardless of the experimental condition or cell type. In this way, variation of the expression ratios of two real-life housekeeping genes reflects the fact that one (or both) of the genes is (are) not constantly expressed, with increasing variation in ratio corresponding to decreasing expression stability.β
For a further explanation see the section βGene-stability measure and ranking of selected housekeeping genesβ in the paper above.
Hints for geNorm implementation:
- Remember that R is vectorized. In this way you can directly write most of equations as you see them on paper.
- You will most likely need the
apply()function.
Letβs compute the M values:
purrr::map_dbl(ref_genes, ~ compute_m(., cp_plot[ref_genes, ])) %>%
set_names(ref_genes)## ACTB H4 GAPDH H5 RPS9 H11 TBP H26
## 0.5875240 0.6759988 0.7351183 0.5453682
All reference genes seem to be OK. But compare them with target genes - they would not be good reference genes:
purrr::map_dbl(target_genes, ~ compute_m(., cp_plot[target_genes, ])) %>%
set_names(target_genes)## CD24 CD31 CXCR4 VEGFA
## 1.622029 2.503237 2.180548 1.592555
Normalization to reference genes: \(\Delta C_P\)
For each sample we calculate a geometric mean of reference genes. Donβt be confused, itβs just this formula applied to reference genes of each sample.
norm <- apply(cp[ref_genes, ], 2, function(x) {
log(x) %>% mean() %>% exp()
})
head(norm)## c1_50_r1 c1_50_r2 c1_90_r1 c1_90_r2 c2_50_r1 c2_50_r2
## 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331
Letβs look how these geometric means of reference genes behave. As you can see, they lie in the centre of reference genes, and thatβs what we want (and expect).
gene_cor_norm <- rbind(cp, norm = norm) %>%
t() %>%
cor(use = "pairwise.complete.obs")
genes_df_norm <- genes_df %>%
dplyr::mutate(gene_group = factor(gene_group, levels = c("reference", "target", "norm"))) %>%
rbind(norm = c("norm", "norm"))
ggcorrplot::ggcorrplot(gene_cor_norm, method = "circle") +
ggtitle(main_title)plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "euclidean", title = "Hierarchical clustering (Euclidean distance)")plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "pearson", title = "Hierarchical clustering (Pearson distance)")plot_pca(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plotNow we calculate \(\Delta C_P\) values by subtracting the reference gene means from target genes. First, we create a matrix of the same size as cp, with reference gene mean of each sample in columns.
ref_gene_cp_matrix <- matrix(norm, ncol = ncol(cp), nrow = nrow(cp), byrow = TRUE)
ref_gene_cp_matrix[1:5, 1:5]## [,1] [,2] [,3] [,4] [,5]
## [1,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [2,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [3,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [4,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [5,] 23.69656 23.90002 24.05644 24.08679 23.30099
Now we are ready for subtraction, that is, to calculate \(\Delta C_P = C_P^t - C_{P,mean}^r\):
delta_cp <- cp - ref_gene_cp_matrix
delta_cp[1:5, 1:5]Now we average the technical replicates. We will use a handy function from limma package, which we will use later for microarray data analysis. avearrays() is basically doing this: splits columns to groups, calculates within group average of each gene and concatenates the results to matrix or dataframe. We split our samples by sample_name column in pheno_data. You can see that for each unique sample_name, there is either replicate 1 or 2.
head(pheno_data)delta_cp_avg <- limma::avearrays(delta_cp, pheno_data$sample_name)
delta_cp_avg[1:5, 1:5]## c1_50 c1_90 c2_50 c2_90 c3_50
## CD24 8.09171 8.063385 8.59785 8.619696 8.614854
## CD31 12.29171 10.143385 11.40785 11.369696 10.450404
## CXCR4 12.12998 NaN 13.63669 13.757579 14.114854
## VEGFA 9.18171 8.678385 9.46285 8.759696 8.794854
## ACTB H4 1.59171 1.503385 1.78785 1.944696 1.879854
Just for control:
(rowMeans(delta_cp[, c("c1_50_r1", "c1_50_r2")], na.rm = TRUE) == delta_cp_avg[, "c1_50"]) %>%
all() %>%
stopifnot()Letβs finalize the \(\Delta C_P\) calculation. We keep only target genes and again, create a separate matrix for plotting purposes. We also add \(\Delta C_P\) to our long data.
delta_cp_avg <- delta_cp_avg[target_genes, ]
delta_cp_avg_plot <- replace(delta_cp_avg, is.na(delta_cp_avg), ceiling(max(delta_cp_avg, na.rm = TRUE) + 1))Remove replicates from pheno_data and cp_long:
pheno_data_avg <- dplyr::filter(pheno_data, replicate == 1) %>%
dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
dplyr::select(sample_name, everything()) %>%
set_rownames(.$sample_name)
cp_long <- dplyr::filter(cp_long, replicate == 1) %>%
dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
dplyr::select(sample_name, everything())And add computed \(\Delta C_P\) to long data:
cp_long <- as.data.frame(delta_cp_avg) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp") %>%
dplyr::right_join(cp_long, by = c("sample_name", "gene"))
cp_long <- as.data.frame(delta_cp_avg_plot) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp_plot") %>%
dplyr::right_join(cp_long, by = c("sample_name", "gene"))
cp_long_target <- dplyr::filter(cp_long, gene_group == "target")
head(cp_long_target)As you can see, the code above is quite repetitive - better would be to write a function for that π
Exploratory analysis on summarised and normalised data
Now when our data are normalized, we can check for sample outliers, as these could mess up differential expression analysis.
plot_pca(delta_cp_avg_plot, pheno_data_avg, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name")$plotfor (variable in coloring_variables) {
plot_hc(
delta_cp_avg_plot,
color_by = pheno_data_avg[, variable],
method_distance = "euclidean",
method_clustering = "complete",
color_by_lab = variable
)
}plot_heatmap(delta_cp_avg_plot, column_annotation = pheno_data_avg[, coloring_variables], title = "qPCR", legend_title = "deltaCp") %>%
draw(merge_legend = TRUE)We can see that samples c1_90, sp3_50 and sp1_90 are a little bit suspicious. Stricter people would consider their removal, but that is the last solution if final results are not satisfactory. In biology, one has to always keep in mind its natural variation.
Calculating (\(log_2\)) fold-changes: \(-\Delta\Delta C_P\)
Now we calculate the relative change in gene expression between sample groups. For the sake of simplicity, we donβt take into account other variables, e.g.Β confluence. Also from the exploratory analysis you can see that these variables donβt have much effect on the separation of samples - sample group has the largest influence.
First, we calculate the \(mean(C_P)\) of control samples:
(control_means <- cp_long_target %>%
dplyr::filter(sample_group == "control") %>%
dplyr::group_by(gene) %>%
dplyr::summarise(delta_cp_control_mean = mean(delta_cp, na.rm = TRUE))
)Then we join this table with our long data and calculate:
- \(\Delta \Delta C_P = \Delta C_{P, treatment} - \text{mean}(\Delta C_{P, control})\)
- \(log_2(\text{fold-change}) = -\Delta \Delta C_P\)
- \(\text{fold-change} = 2^{LFC}\)
cp_long_target <- cp_long_target %>%
dplyr::left_join(control_means, by = "gene") %>%
dplyr::mutate(
delta_delta_cp = delta_cp - delta_cp_control_mean,
lfc = -delta_delta_cp,
fc = 2^lfc
) %>%
dplyr::select(sample_name, gene, cp, cp_plot, delta_cp, delta_cp_plot, delta_cp_control_mean, delta_delta_cp, lfc, fc, everything())
head(cp_long_target)\(\text{mean}(\Delta \Delta C_{P, control})\) is now \(0\), because we subtracted \(\text{mean}(\Delta C_P, control)\) also from the control samples. Of course this also applies for \(LFC\), which is just \(-\Delta \Delta C_P\). As you will see, this is useful for plotting (boxplots).
Boxplots π
Boxplots are very informative when you want to see a difference between groups.
Function to be implemented within the assignment. It should work for any type of long data, as you will be able to specify the grouping column to split boxplots on x-axe, and column with values to plot on y-axe. Tip: you can use ggpubr::ggboxplot()
First we will plot the \(\Delta C_P\) values:
plot_boxplots(
cp_long_target,
x = "sample_group",
y = "delta_cp",
facet_by = "gene",
color_by = "sample_group",
main = "delta Cp values",
y_lab = "delta Cp"
)And this is tricky to interpret at the first sight, because in spirulina group, all genes are actually upregulated! Also, you cannot directly see the \(LFCs\) - in your head, you have to calculate the difference between spirulina and control and flip the sign.
So better is to plot the \(LFCs\). Why not the \(\text{fold-changes}\)? Because \(LFCs\) are easier to read, as changes (relative to controls) are multiplicative and symmetric:
| \(\text{fold-change}\) | 0.125 | 0.25 | 0.5 | 1 | 2 | 4 | 8 |
| \(log_2(\text{fold-change})\) | -3 | -2 | -1 | 0 | 1 | 2 | 3 |
| You read: Change is⦠| 8x less | 4x less | 2x less | no change | 2x more | 4x more | 8x more |
This figure summarises how \(LFCs\) were calculated and why it is better for boxplots π€
Now you can clearly see that spirulina group have upregulated genes:
plot_boxplots(
cp_long_target,
x = "sample_group",
y = "lfc",
facet_by = "gene",
color_by = "sample_group",
main = "log2 fold-changes",
y_lab = "log2 fold-change"
)We can also split plots by confluence:
plot_boxplots(
cp_long_target,
x = "sample_group",
y = "lfc",
facet_by = "gene",
color_by = "confluence",
main = "log2 fold-changes",
y_lab = "log2 fold-change"
)Alternatively to boxplots, we can make barplots, which are quite popular in papers π
ggpubr::ggbarplot(
cp_long_target,
x = "sample_group",
y = "fc",
add = c("mean", "jitter"),
fill = "sample_group",
facet.by = "gene"
) +
scale_y_continuous("Relative concentration", labels = scales::percent) +
labs(x = NULL, fill = "Sample Group")Or we can plot a heatmap of LFC values or their z-scores:
lfc_wide <- pivot_wider(cp_long_target, id_cols = c(sample_name, gene), names_from = sample_name, values_from = lfc) %>%
column_to_rownames(var = "gene")
lfc_wide[, 1:5]plot_heatmap(
lfc_wide,
column_annotation = dplyr::select(pheno_data_avg, `Sample Group` = sample_group),
title = "LFC between spirulina and control",
legend_title = "LFC",
cluster_rows = FALSE,
cluster_columns = FALSE
) %>% draw(merge_legend = TRUE)plot_heatmap(
lfc_wide,
z_score = TRUE,
column_annotation = dplyr::select(pheno_data_avg, `Sample Group` = sample_group),
title = "LFC between spirulina and control",
legend_title = "LFC z-score",
cluster_rows = FALSE,
cluster_columns = FALSE
) %>% draw(merge_legend = TRUE)t-test π
Functions to be implemented within the assignment. You should implement both
test_gene()andtest_gene_table(). If you do it smart, you can just call the former from the latter. Hint: look at theformulaparameter oft.testfunction.
t-test is testing whether two means, coming from samples having Studentβs distribution, significantly differ. T-test assumes that the input \(C_P\) values are normally distributed and the variance between conditions is comparable. Wilcoxon test can be used when sample size is small and those two last assumptions are hard to achieve.
As we will use two-sided alternative (i.e.Β gene is up- or downregulated = deregulated), it doesnβt matter which one of \(\Delta C_P\), \(\Delta \Delta C_P\) or \(LFC\) we will use. In all cases, the absolute difference between group means remains the same:
group_means <- dplyr::filter(cp_long_target, gene == "CD24") %>%
dplyr::group_by(sample_group) %>%
dplyr::summarise(
delta_cp_mean = mean(delta_cp),
delta_delta_cp_mean = mean(delta_delta_cp),
lfc_mean = mean(lfc)
) %>%
as.data.frame() %>%
tibble::column_to_rownames("sample_group")
group_meansapply(group_means, MARGIN = 2, FUN = function(x) abs(x[1] - x[2]))## delta_cp_mean delta_delta_cp_mean lfc_mean
## 1.057847 1.057847 1.057847
for (g in target_genes) {
test_gene(
gene = g,
gene_data = cp_long_target,
gene_col = "gene",
value_col = "lfc",
group_col = "sample_group",
test = t.test,
verbose = TRUE
)
}## CD24: spirulina vs. control
## t.test p-value: 0.000959 ***
## CD31: spirulina vs. control
## t.test p-value: 0.00535 **
## CXCR4: spirulina vs. control
## t.test p-value: 0.168 NS
## VEGFA: spirulina vs. control
## t.test p-value: 0.00234 **
But better is to return a table for all genes:
(test_table <- test_gene_table(
gene_data = cp_long_target,
gene_col = "gene",
value_col = "delta_cp",
group_col = "sample_group"
))Cleanup
save.image(here(BASE_DIR, "qPCR.RData"))
warnings()
traceback()## No traceback available
sessioninfo::session_info()## β Session info ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## setting value
## version R version 4.0.4 (2021-02-15)
## os Debian GNU/Linux 10 (buster)
## system x86_64, linux-gnu
## ui RStudio
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2021-03-29
##
## β Packages ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## package * version date lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 4.0.4)
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.4)
## backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.4)
## BiocGenerics 0.36.0 2020-10-27 [1] Bioconductor
## bookdown 0.21 2020-10-13 [1] CRAN (R 4.0.4)
## broom 0.7.5 2021-02-19 [1] CRAN (R 4.0.4)
## bslib 0.2.4 2021-01-25 [1] CRAN (R 4.0.4)
## cachem 1.0.4 2021-02-13 [1] CRAN (R 4.0.4)
## Cairo 1.5-12.2 2020-07-07 [1] CRAN (R 4.0.4)
## car 3.0-10 2020-09-29 [1] CRAN (R 4.0.4)
## carData 3.0-4 2020-05-22 [1] CRAN (R 4.0.4)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.4)
## circlize 0.4.12 2021-01-08 [1] CRAN (R 4.0.4)
## cli 2.3.1 2021-02-23 [1] CRAN (R 4.0.4)
## clue 0.3-58 2020-12-03 [1] CRAN (R 4.0.4)
## cluster 2.1.1 2021-02-14 [3] CRAN (R 4.0.4)
## codetools 0.2-18 2020-11-04 [3] CRAN (R 4.0.4)
## colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.4)
## ComplexHeatmap * 2.6.2 2020-11-12 [1] Bioconductor
## conflicted * 1.0.4 2019-06-21 [1] CRAN (R 4.0.4)
## crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.4)
## crosstalk 1.1.1 2021-01-12 [1] CRAN (R 4.0.4)
## curl 4.3 2019-12-02 [1] CRAN (R 4.0.4)
## data.table 1.14.0 2021-02-21 [1] CRAN (R 4.0.4)
## DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.4)
## dbplyr 2.1.0 2021-02-03 [1] CRAN (R 4.0.4)
## debugme 1.1.0 2017-10-22 [1] CRAN (R 4.0.4)
## dendextend 1.14.0 2020-08-26 [1] CRAN (R 4.0.4)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.4)
## dplyr * 1.0.5 2021-03-05 [1] CRAN (R 4.0.4)
## ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.4)
## emo * 0.0.0.9000 2021-02-26 [1] Github (hadley/emo@3f03b11)
## evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.4)
## fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.4)
## farver 2.1.0 2021-02-28 [1] CRAN (R 4.0.4)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.4)
## forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.0.4)
## foreach 1.5.1 2020-10-15 [1] CRAN (R 4.0.4)
## foreign 0.8-81 2020-12-22 [3] CRAN (R 4.0.4)
## friendlyeval * 0.1.1 2021-02-26 [1] Github (milesmcbain/friendlyeval@7ea2ed9)
## fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.4)
## generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.4)
## GetoptLong 1.0.5 2020-12-15 [1] CRAN (R 4.0.4)
## GGally 2.1.0 2021-01-06 [1] CRAN (R 4.0.4)
## ggcorrplot 0.1.3 2019-05-19 [1] CRAN (R 4.0.4)
## ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 4.0.4)
## ggpubr 0.4.0 2020-06-27 [1] CRAN (R 4.0.4)
## ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.0.4)
## ggsignif 0.6.1 2021-02-23 [1] CRAN (R 4.0.4)
## GlobalOptions 0.1.2 2020-06-10 [1] CRAN (R 4.0.4)
## glue * 1.4.2 2020-08-27 [1] CRAN (R 4.0.4)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.4)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.4)
## haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.4)
## heatmaply 1.2.1 2021-02-02 [1] CRAN (R 4.0.4)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.0.4)
## highr 0.8 2019-03-20 [1] CRAN (R 4.0.4)
## hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.4)
## htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.4)
## htmlwidgets 1.5.3 2020-12-10 [1] CRAN (R 4.0.4)
## httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.4)
## IRanges 2.24.1 2020-12-12 [1] Bioconductor
## iterators 1.0.13 2020-10-15 [1] CRAN (R 4.0.4)
## jquerylib 0.1.3 2020-12-17 [1] CRAN (R 4.0.4)
## jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.4)
## klippy * 0.0.0.9500 2021-02-26 [1] Github (rlesur/klippy@378c247)
## knitr * 1.31 2021-01-27 [1] CRAN (R 4.0.4)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.4)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.0.4)
## lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
## limma 3.46.0 2020-10-27 [1] Bioconductor
## lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.0.4)
## magick 2.6.0 2021-01-13 [1] CRAN (R 4.0.4)
## magrittr * 2.0.1 2020-11-17 [1] CRAN (R 4.0.4)
## matrixStats 0.58.0 2021-01-29 [1] CRAN (R 4.0.4)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.4)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.4)
## openxlsx 4.2.3 2020-10-27 [1] CRAN (R 4.0.4)
## patchwork * 1.1.1 2020-12-17 [1] CRAN (R 4.0.4)
## pillar 1.5.1 2021-03-05 [1] CRAN (R 4.0.4)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.4)
## plotly 4.9.3 2021-01-10 [1] CRAN (R 4.0.4)
## plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.4)
## png 0.1-7 2013-12-03 [1] CRAN (R 4.0.4)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.4)
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.4)
## RColorBrewer * 1.1-2 2014-12-07 [1] CRAN (R 4.0.4)
## Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.4)
## readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.4)
## readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.4)
## registry 0.5-1 2019-03-05 [1] CRAN (R 4.0.4)
## reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.4)
## reshape 0.8.8 2018-10-23 [1] CRAN (R 4.0.4)
## reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.0.4)
## rio 0.5.26 2021-03-01 [1] CRAN (R 4.0.4)
## rjson 0.2.20 2018-06-08 [1] CRAN (R 4.0.4)
## rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.4)
## rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4)
## rmdformats * 1.0.1 2021-01-13 [1] CRAN (R 4.0.4)
## rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.4)
## rstatix 0.7.0 2021-02-13 [1] CRAN (R 4.0.4)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.4)
## rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.4)
## S4Vectors 0.28.1 2020-12-09 [1] Bioconductor
## sass 0.3.1 2021-01-24 [1] CRAN (R 4.0.4)
## scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.4)
## seriation 1.2-9 2020-10-01 [1] CRAN (R 4.0.4)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.4)
## shape 1.4.5 2020-09-13 [1] CRAN (R 4.0.4)
## stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.4)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.4)
## tibble * 3.1.0 2021-02-25 [1] CRAN (R 4.0.4)
## tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.0.4)
## tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.4)
## tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.4)
## TSP 1.1-10 2020-04-17 [1] CRAN (R 4.0.4)
## utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.4)
## vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.4)
## viridis 0.5.1 2018-03-29 [1] CRAN (R 4.0.4)
## viridisLite 0.3.0 2018-02-01 [1] CRAN (R 4.0.4)
## webshot 0.5.2 2019-11-22 [1] CRAN (R 4.0.4)
## withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.4)
## xfun 0.21 2021-02-10 [1] CRAN (R 4.0.4)
## xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.4)
## yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.4)
## zip 2.1.1 2020-08-27 [1] CRAN (R 4.0.4)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/lib/R/site-library
## [3] /usr/lib/R/library
HTML rendering
This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell π€―
library(conflicted)
library(knitr)
library(glue)
library(here)
library(emo)
if (!require(rmdformats)) {
BiocManager::install("rmdformats")
}
# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE, eval = TRUE)
rmarkdown::render(here("E03-qPCR/qPCR.Rmd"), output_file = here("E03-qPCR/qPCR.html"), envir = new.env())